Scalable Distributed Query Processing in Parallel Main-Memory Database Systems
نویسنده
چکیده
The continuous increase in compute speed and main-memory capacity of modern servers triggered the development of a new generation of in-memory database systems. These systems completely rewrote the traditional database architecture to use main memory as primary storage. Discarding several now obsolete abstractions of disk-based database systems enabled unprecedented query performance on a single server. However, network communication slows down queries as soon as multiple servers are involved. The result is a significant performance gap between local and distributed query processing. Still, a scale out to a cluster becomes inevitable when the workload exceeds the capacity of a single server. This thesis seeks to further the state-of-the-art of distributed query processing in parallel main-memory database systems by addressing the performance barrier introduced by network communication. Thus, instead of concentrating on an isolated problem, we design a novel distributed query engine that adapts to the available network bandwidth as well as unexpected workload characteristics that hinder scalability. It exploits locality to speed up query processing over commodity networks and implements a novel parallelism model to fully leverage modern high-speed interconnects. We prove the feasibility of our design with a prototypical implementation for the high-performance in-memory database system HyPer. Using redo log multicasting and global transaction-consistent snapshots, the engine further enables query processing on fresh transactional data. An extensive evaluation with the renowned TPC-H analytical benchmark demonstrates that HyPer with our novel distributed query engine not only outperforms competing parallel database systems but also scales its query performance with the cluster size.
منابع مشابه
AMOS-SDDS: A Scalable Distributed Data Manager for Windows Multicomputers
Known parallel DBMS offer at present only static partitioning schemes. Adding a storage node is then a cumbersome operation that typically requires the manual data redistribution. We present an architecture termed AMOS-SDDS for a share-nothing multicomputer. We have coupled a high-performance main-memory DBMS AMOS-II and a manager of Scalable Distributed Data Structures (SDDS) into a scalable d...
متن کاملMaking XML Database Systems Scalable to Computer Resources and Data Volumes
Increasing use of XML has emphasized the need for scalable database systems that are capable of handling a large amount of XML data efficiently. This study explores effective methods for making a scalable XML database system in the following aspects: (a) scalability to data volumes, (b) scalable XML processing with a shared-nothing PC cluster, and (c) scalable database processing on shared-memo...
متن کاملEffective Spatial Data Partitioning for Scalable Query Processing
Recently, MapReduce based spatial query systems have emerged as a cost effective and scalable solution to large scale spatial data processing and analytics. MapReduce based systems achieve massive scalability by partitioning the data and running query tasks on those partitions in parallel. Therefore, effective data partitioning is critical for task parallelization, load balancing, and directly ...
متن کاملTuning a Parallel Database Algorithm on a Shared-memory Multiprocessor
Database query processing can benefit significantly from parallelism. Parallel database algorithms combine substantial CPU and I/O activity, memory requirements, and massive data exchange between processes, all of which must he considered to obtain optimal performance. Since parallel external sorting is a very typical example, we have focused on sorting to tune Volcano, a new query processing s...
متن کاملDistributed Graph Layout for Scalable Small-world Network Analysis
The in-memory graph layout or organization has a considerable impact on the time and energy efficiency of distributed memory graph computations. It affects memory locality, inter-task load balance, communication time, and overall memory utilization. Graph layout could refer to partitioning or replication of vertex and edge arrays, selective replication of data structures that hold meta-data, an...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016